[IPv6crawler-wg] An important update about the IPv6 Matrix Project
Olivier MJ Crepin-Leblond
ocl at gih.com
Thu Nov 26 18:32:49 GMT 2015
Hello all,
Two worthy pieces of news regarding the IPv6 Matrix Project (
http://www.ipv6matrix.org ):
1. I have updated the Web site with the latest results ending in late
October - hence noting a Crawl display date of November 2015.
We now have 294 Gb of comma separated value text data in store, tracing
back the spread of the IPv6 Internet since July 2010.
Altogether, we ran the text approximately 36 times on all 1 million
Alexa busiest Domain names. This represented testing of about 6.5
million hosts, carefully collecting traceroute information for each and
every of them. We now have a very unique database that is showing the
spread of the IPv6 Internet information sources worldwide.
2. Today I took out my very dusty Linux & Python gloves and performed a
much needed update to the IPv6 Matrix Crawler input database, including
the Alexa 1 million list as well as GeoIP Databases.
Indeed, the Alexa database of the world's 1 million busiest Web sites
dated from the Crawler's first inception in the first half of 2010.
We're more than 5 years later!
In a way, keeping the same input database has kept the base of crawls
the steady thus the ability to compare results was possible. However,
the flip-side of the coin is that we are ending up with more and more
domain names marked as being dysfunctional. Nearly 5% of the domain
names in the database were unreachable. The updated input database
should resolve this, but we might also see a jump in some results. It
will be interesting to see what the next run yields.
Why do we not update the input database more often? Because buried in
that database are the domain names of the people who wanted to opt out
over the years. Having never thought about this, I spent several hours
tracing back 5 years of emails of people complaining about the crawl
triggering their firewalls. I put together a blacklist of domain names I
have manually deleted from the crawl input files.
The blacklist, as it stands now:
Deleted:
it-mate.co.uk
indianic.com
your-server.de
catacombscds.com
dewlance.com
tcs.com
printweb.de
nocser.net
shoppingnsales.com
bsaadmail.com
epayservice.ru
4footyfans.com
guitarspeed99.com
saga.co.uk
Already gone from the current Alexa list:
infinityautosurf.com
canada-traffic.com
usahitz.com
jawatankosong.com.my
4d.com.my
fitnessuncovered.co.uk
kualalumpurbookfair.com
xgen-it.com
bpanet.de
edns.de
back2web.de
waaaouh.com
every-web.com
w3sexe.com
gratuits-web.com
france-mateur.com
pliagedepapier.com
immobilieretparticuliers.com
chronobio.com
stickers-origines.com
tailor-made.co.uk
With these out of the input files, we are able to start the next crawl. *
I hope I have not missed any complaints, but if I have, this is advance
notice that we might receive a few emails in the forthcoming weeks. We
might also receive a few emails from sites that have appeared on the
Alexa 1 million list since 2010.*
Back to this list, the excellent filtering program which was used to
process the original list and clean it up was used again for the modern
list. The Alexa list had a number of domain names which were actually
sub-directories in the past, as well as some invalid domains. Alexa has
since tightened its act. The latest Alexa list is much cleaner. It holds
999998 valid domains vs. 984587 domains for the original 2010 list.
Finally, new gTLDs have now appeared in the Alexa list, including some
Internationalised Domain Names (IDNs). The world is indeed a very
different place!
It will be interesting to see how the Crawler as well as all other
scripts to process the information into displayable data on the Web
server, will cope with these:
academy.csv consulting.csv guide.csv one.csv
supply.csv
accountant.csv contractors.csv guru.csv onl.csv
support.csv
actor.csv cool.csv hamburg.csv online.csv
surf.csv
ads.csv country.csv haus.csv ooo.csv
swiss.csv
adult.csv creditcard.csv healthcare.csv orange.csv
sydney.csv
agency.csv cricket.csv help.csv ovh.csv
systems.csv
alsace.csv cymru.csv hiphop.csv paris.csv
taipei.csv
amsterdam.csv dance.csv holiday.csv partners.csv
tattoo.csv
app.csv date.csv horse.csv parts.csv
team.csv
archi.csv dating.csv host.csv party.csv
tech.csv
associates.csv deals.csv hosting.csv photo.csv
technology.csv
attorney.csv delivery.csv house.csv photography.csv
theater.csv
auction.csv desi.csv how.csv photos.csv
tienda.csv
audio.csv design.csv immobilien.csv pics.csv
tips.csv
axa.csv dev.csv immo.csv pictures.csv
tirol.csv
barclaycard.csv diet.csv ink.csv pink.csv
today.csv
barclays.csv digital.csv international.csv pizza.csv
tokyo.csv
bar.csv direct.csv investments.csv place.csv
tools.csv
bargains.csv directory.csv irish.csv plus.csv
top.csv
bayern.csv discount.csv jetzt.csv poker.csv
town.csv
beer.csv dog.csv joburg.csv porn.csv
toys.csv
berlin.csv domains.csv juegos.csv post.csv
trade.csv
best.csv earth.csv kim.csv press.csv
training.csv
bid.csv education.csv kitchen.csv prod.csv
trust.csv
bike.csv email.csv kiwi.csv productions.csv
university.csv
bio.csv emerck.csv koeln.csv properties.csv
uno.csv
black.csv energy.csv krd.csv property.csv
uol.csv
blackfriday.csv equipment.csv kred.csv pub.csv
vacations.csv
blue.csv estate.csv land.csv quebec.csv
vegas.csv
bnpparibas.csv eus.csv law.csv realtor.csv
ventures.csv
boo.csv events.csv legal.csv recipes.csv
video.csv
boutique.csv exchange.csv life.csv red.csv
vision.csv
brussels.csv expert.csv limited.csv rehab.csv
voyage.csv
build.csv exposed.csv link.csv reise.csv
wales.csv
builders.csv express.csv live.csv reisen.csv
wang.csv
business.csv fail.csv lol.csv ren.csv
watch.csv
buzz.csv faith.csv london.csv rentals.csv
webcam.csv
bzh.csv farm.csv love.csv repair.csv
website.csv
cab.csv finance.csv luxury.csv report.csv
wien.csv
camera.csv fish.csv management.csv rest.csv
wiki.csv
camp.csv fishing.csv mango.csv review.csv
win.csv
capital.csv fit.csv market.csv reviews.csv
windows.csv
cards.csv fitness.csv marketing.csv rio.csv
work.csv
care.csv flights.csv markets.csv rip.csv
works.csv
career.csv foo.csv media.csv rocks.csv
world.csv
careers.csv football.csv melbourne.csv ruhr.csv
wtf.csv
casa.csv forsale.csv menu.csv ryukyu.csv
xn--3e0b707e.csv
cash.csv foundation.csv microsoft.csv sale.csv
xn--4gbrim.csv
casino.csv frl.csv moda.csv scb.csv
xn--80adxhks.csv
center.csv fund.csv moe.csv school.csv
xn--80asehdb.csv
ceo.csv futbol.csv monash.csv science.csv
xn--90ais.csv
chat.csv gal.csv money.csv scot.csv
xn--d1acj3b.csv
church.csv gallery.csv moscow.csv services.csv
xn--j1amh.csv
city.csv garden.csv movie.csv sexy.csv
xn--p1ai.csv
claims.csv gent.csv nagoya.csv shiksha.csv
xn--pgbs0dh.csv
click.csv gift.csv network.csv shoes.csv
xn--q9jyb4c.csv
clinic.csv gifts.csv new.csv singles.csv
xn--wgbl6a.csv
clothing.csv glass.csv news.csv site.csv
xxx.csv
club.csv global.csv nexus.csv social.csv
xyz.csv
coach.csv globo.csv ngo.csv software.csv
yandex.csv
codes.csv gmail.csv ninja.csv solar.csv
yoga.csv
coffee.csv goo.csv nrw.csv solutions.csv
yokohama.csv
college.csv goog.csv ntt.csv soy.csv
youtube.csv
community.csv google.csv nyc.csv space.csv
zone.csv
company.csv graphics.csv office.csv style.csv
computer.csv gratis.csv okinawa.csv sucks.csv
In the meantime I'd like to cite again the Nile University Crew expertly
led by Sameh El Ansary for designing and coding a Crawler's that been
able to cope with shifting through 5 years of DNS junk with minimal
maintenance, save the love and attention I give the servers by keeping
them up to date with patches so they don't end up toppling over. They
haven't been rebooted in 464 days and I am crossing fingers for their
well-being.
And of course, thanks to the University of Southampton Crew who built
the excellent 2nd version of the Web Site under Tim Chown's supervision.
I am still writing an article for RIPE Labs - just struggling to find
the time to finish it, but getting there.
Warmest regards,
Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20151126/d572f954/attachment-0001.html>
More information about the IPv6crawler-wg
mailing list