Hi every one, How could I miss crawler 2.0 and posted 3.0 before this. Here I am posting 2.0 crawler with multiprocess facility.. 😉 Actually 3.0 thread based crawler was easy to develop, and now it is the time for release of final 2.0.
Why I am making crawler, actually me and my friends Abhijeet and Zainab were thinking of making basic search engine. But we know there are already better than our’s. Then we thought we can do some better with this crawler thing, and now one more guy joined us, Mr Nirav quite high skilled person and work on highly critical projects.
Now I am more sure of finishing all this in time and make an automatic system that will post all new thing on bestindianwear.com. We can say it will be a basic AI (Artificial Intelligent) project. Abhijeet is working quite hard on it
Thanks Guys – I do not feel alone, and your efforts make our way enjoyable. Cheers to everyone we will be finishing this soon… 🙂
Yes as you read above, one of my colleague asked me what would be the speed of parallel, thread based crawler, Now I am posting this to so that you all can check out the speed. Now how to use this, its very simple, it is written in the file itself. check it out….. 😉 enjoy and let me sleep now!
Crazy day, I indexed 30GB file having 53 million lines of json data to elastic. Then I tried kibana with it it was really enjoyable after doing it with my drink. Link to kibana is shivalink.com:5601.
Link to exastic is shivalink.com:9200
the most tough was to unzip 5GB file using all cores, it was bz2 file. I used pbzip2 but it didn’t worked in my case. Then I found lbzip2 -d myfile.json. It was really fast and used my all cores efficiently. It turned out to be 30GB then. After that how could we insert it to elastic, as I am very new to this I found esbulk and started with this. I inserted 45 million entries then It became too slow. Now I had no option other and stopping it right there.
Than I came up with new idea of tail -n No of rest of the entries and inserted them back. I successfully did it. Now I can say I kind of know big big data….. 🙂 feeling happy
I is really interesting to use remote cloud server computer as personal computer remotely, I am using Digital ocean SSD cloud it is really very fast and one of the best choices. you can play games and do you stuff speed is 1Gbps it is really awesome. follow this link here https://www.digitalocean.com/community/tutorials/how-to-setup-vnc-for-ubuntu-12
Completed coding of recursive crawler, it was fun and a lot of hard work, some meditation, and lots of google. I finally did it. My friend Abhijeet asked to make recursive crawler and I was thinking how can I do that. So came up with this idea wo making two lists
1. processed list (All crawled urls are stored here)
2. unprocessed list (All new url are stored here)
Now if a new url exists in any of these lists then skip it and move furthur. Happy crawling guys…..:)
This program do the following thing
- store data in mongodb
- parse html in page title, meta data, meta keywords
- In case if page request fails error handling save it from breaking
- it does not follow any other domain except the given one
Here is the link https://github.com/vishvendrasingh/crawler.git
It is really nice experience in new company with python, I did something really useful. Scraped linkdin and stored a really good data in cloud with tor and privoxy….:) happy now
A more easy and advance implementation for storing image in mysqli. As I found so less for storing image in database with mysqli so I made my own script. Try it here
Know the code, how to store images in mysql an easy implementation. Get the code here on Github.
Add this in your cPanel or cloud configure it a little and enjoy the full backup
BRTS – Fast bus service in ahmedabad , is having the best bus drivers. They drive very fast & save ass of @$/k guys on road. People here drive in BRTS VIP corridor also, where they should not neither permitted. Today BRTS bus driver saved a kid by applying breaks on time & waited for him until he crossed the road.