-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce job runs forever when attempting to process earthquake sample against US Zip5 shapefile #13
Comments
Just so you can see it, the Zip5 shape file in JSON form appears as shown below. {
"fields": [
{
"alias": "ZIP",
"length": 5,
"type": "esriFieldTypeString",
"name": "ZIP"
},
{
"alias": "NAME",
"length": 40,
"type": "esriFieldTypeString",
"name": "NAME"
},
{
"alias": "ZIPTYPE",
"length": 20,
"type": "esriFieldTypeString",
"name": "ZIPTYPE"
},
{
"alias": "STATE",
"length": 2,
"type": "esriFieldTypeString",
"name": "STATE"
},
{
"alias": "STATEFIPS",
"length": 2,
"type": "esriFieldTypeString",
"name": "STATEFIPS"
},
{
"alias": "COUNTYFIPS",
"length": 5,
"type": "esriFieldTypeString",
"name": "COUNTYFIPS"
},
{
"alias": "COUNTYNAME",
"length": 60,
"type": "esriFieldTypeString",
"name": "COUNTYNAME"
},
{
"alias": "S3DZIP",
"length": 3,
"type": "esriFieldTypeString",
"name": "S3DZIP"
},
{
"alias": "LAT",
"type": "esriFieldTypeDouble",
"name": "LAT"
},
{
"alias": "LON",
"type": "esriFieldTypeDouble",
"name": "LON"
},
{
"alias": "EMPTYCOL",
"length": 5,
"type": "esriFieldTypeString",
"name": "EMPTYCOL"
},
{
"alias": "TOTRESCNT",
"type": "esriFieldTypeDouble",
"name": "TOTRESCNT"
},
{
"alias": "MFDU",
"type": "esriFieldTypeDouble",
"name": "MFDU"
},
{
"alias": "SFDU",
"type": "esriFieldTypeDouble",
"name": "SFDU"
},
{
"alias": "BOXCNT",
"type": "esriFieldTypeDouble",
"name": "BOXCNT"
},
{
"alias": "BIZCNT",
"type": "esriFieldTypeDouble",
"name": "BIZCNT"
},
{
"alias": "RELVER",
"length": 8,
"type": "esriFieldTypeString",
"name": "RELVER"
},
{
"alias": "COLOR",
"type": "esriFieldTypeDouble",
"name": "COLOR"
}
],
"hasZ": false,
"hasM": false,
"spatialReference": {"wkid":4326},
"features": [
{
"attributes": {
"BOXCNT": 0.0,
"COUNTYFIPS": "56001",
"NAME": " ",
"ZIP": "820MX",
"COLOR": 99.0,
"COUNTYNAME": "ALBANY",
"STATEFIPS": "56",
"TOTRESCNT": 0.0,
"LON": -105.947315216,
"RELVER": "1.12.4",
"EMPTYCOL": " ",
"BIZCNT": 0.0,
"STATE": "WY",
"S3DZIP": "820",
"LAT": 41.6488456726074,
"MFDU": 0.0,
"ZIPTYPE": "FILLER",
"SFDU": 0.0
},
"geometry": {
"rings": [
[
[
-105.952858,
41.646918
],
[
-105.954499,
41.646716999999995
], |
Hi @jmadison222. A couple questions...
Getting stuck at 90% seems odd. There isn't any extra verbose logging. Have you looked at the MapReduce logs of your cluster? |
Thanks for the quick reply! Pulled on 7/3. Query is: SELECT zip5_CA.CountyName, count(*) cnt FROM zip5_CA
JOIN earthquakes
WHERE ST_Contains(zip5_CA.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY zip5_CA.CountyName
ORDER BY cnt desc; Where the zip5_CA table is: create table zip5_CA as Where the zip5 table is: CREATE EXTERNAL TABLE IF NOT EXISTS zip5 (
State string,
CountyName string,
Zip string,
BoundaryShape binary
)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/jmxxxx/data/zip5/no_format'; Such that the earthquakes table is the one provided in the sample. I did check the logs. The only questionable message is:
This messages occurs quite a few times. I'm on CDH4 in case that matters. James |
OK, so basically the same query. The deprecation warning can be ignored. I'm more interested in the logs you see by going through the resource manager, not what you're getting through the Hive command line. What version of Hadoop are you running? |
Hadoop version is: Hadoop 2.0.0-cdh4.7.0. Forgive my ignorance, but where would I find the resource manager log? Is it at the location associated with hadoop.log.dir when I do "ps -ef | grep jobtracker"? In that location I see these files of interest:
Sorry if I'm in the wrong logging location. Where should I be? Thanks! James |
That's not exactly what I'm looking for, but no worries. When you run the Hive job, you should be given a URL you can use to track the MapReduce job. Mine looks like this...
and you're looking specifically for this line...
You're using CDH 4.7, so it may actually be running MapReduce v1. Either way, you should see something like this. |
Found it! Thanks for the detailed instructions. BTW, it finished in 10 hours, but that can't be right. How do I get the log to you in some sane form on this forum? If I paste it, it's a nightmare. |
Yeah 10 hours is crazy. When did you pull the spatial-framework library and how many machines are you running? Surround your log info with ```, like this... ``` Hive> select count(*) from earthquakes; Total MapReduce jobs = 1 Launching Job 1 out of 1 ``` I just updated your comments with the same. |
Got the libraries on 7/3. Happy to re-pull just to get things off the plate. Thanks for helping the new guy with all these basics too! The log doesn't work well as text, so I'll send it with markup:
|
Oh, sorry, and: 4 nodes, each with 4 cores and 128GB of memory. |
Ah just noticed you already told me when you pulled. So I updated the libraries in this repository on 7/7, referencing this blog post. My gut feeling is that you will see things running much faster if you pull the latest. |
Excellent. I'll get the latest. Stay tuned. BTW, we're seeking to replicate a reverse geocoding process that runs in SAS in 10 hours processing our telematics data against the USPS Zip5 shape file. Thus, we have a benchmark to beat. We're hoping for 2 hours. Point is though: if you have any performance enhancements in the queue and were looking for a customer to run your stuff so you can have one of those "one customer took a job running in X and brought it down to Y" stories that vendors love, I'd be happy to be on the bleeding edge if it gives us speed. But let me try the new code first and let you know. |
Very cool - thank you. Those are exactly the stories we're looking for. |
I'd be interested to see how it works with thousands of polygons. You might look in to developing a custom MapReduce job like this sample where you'll have the added benefit of a spatial index on top of the polygons. |
hadoop-0.20. definitely MR1. Generally if a reduce stage zooms up to 90% then stalls for a long time there could be a horrific shuffle operation in there. There may be the potential to implement a combiner (I don't know your use-case and it's not always an option). The combiner would reduce the amount of data to be shuffled prior to reduction, so it generally helps with long-running reduces. Also, I noticed your spilled records number... Reduce input records are 1:1 with Map output records. 92,634 (good, they should match) Might look at information like this: or this: (Or other similar internet searching about how to manage memory and lower record spillage) |
@ddkaiser It is MRv1 (from the logs he posted), but he's running Hadoop 2.0 (cdh4.7). cdh4 ships with YARN disabled by default. Also, he's running Hive queries so I don't know what options he has for tuning the query. |
Great thoughts. A few things:
Thanks! From: Michael Park [mailto:[email protected]] @ddkaiserhttps://github.com/ddkaiser It is MRv1 (from the logs he posted), but he's running Hadoop 2.0 (cdh4.7). cdh4 ships with YARN disabled by default. Also, he's running Hive queries so I don't know what options he has for tuning the query. — This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential and/or privileged information. If you are not the intended recipient, any use, copying, disclosure, dissemination or distribution is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this communication and destroy all copies. |
Not that I know of. We have YARN running on our cluster and it has been fine. I vaguely remember running into some issues upgrading from MRv1 to YARN using CDH 4.1, but they were only configuration issues. Hive is becoming much faster, so if you were to go cutting edge (Hive .13), you may get even better results.
Good strategy. I know that HUE will work (might need some extra configuration), so hopefully the performance is acceptable. If it isn't, there may also be room for improvement on our side so you don't have go with a more complicated route. |
@climbage |
I'm using Hive to do reverse geocoding (RG). I have the earthquake sample working. I then attempt to do the RG on the earthquake data using the Zip5 shape file for the entire U.S. The mapping step runs in seconds. Then the reduce step runs to over 90% completion in a few seconds, but never finishes from there. Hoping to find out why.
I converted the shape file using ArcMap, per the instructions here. That worked fine. I also selected just state = 'CA' to make it somewhat manageable, so the RG I'm doing is just against the CA table (but the same problem happens with the whole country.
At a minimum, is there some type of log setting I can set to verbose and then check a log to see what the code is doing?
All assistance appreciated!
The text was updated successfully, but these errors were encountered: